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Abstract. We describe in a rudimentary fashion how 5'VM(support vector 
machine) plays the role of classifier in a mathematical setting. We then discuss 
its application in the study of multiple (single nucleotide polymorphism) 
variations. Also presented is a set of preliminary test results with clinical data. 

1. Introduction 

It is a generally accepted wisdom that the causes of biological effects can be di- 
vided into two categories - inhritable (genes from parents) and environmental(food, 
gravity, sunlight, surroundings etc). In this paper, we focus on inheritable fac- 
tors. Our suggestion to multiple SNP variations is based on the following general 
assumptions (For more details, see ||). 

1: Suppose all the SNPs are known and there are no environmental factors. 
Then each human is determined by a complete set of SNP variations uniquely. 

Its consequences are: identical twins are exactly the same. Thus it is possible to 
classify SNP data sets into several subgroups. Classification(grouping or clustering) 
is one of basic and important generic method for distinguishing one from another. 

2: To classify objects we are interested in, the most powerful technique people 
developed is to numericalize them, in other words, finding a way of represen- 
tation into numbers and the collection of numbers into vectors in a Euclidean 
space. 

The two assumptions are separately common senses among researchers. The new 
twist is that the two assumptions were not considered in the same scope and SVM 
offers a powerful machinery to tackle the problem of classification in a rigorous and 
systematic way. 

2. Support Vector Machine and Its Analogy 

The concept of SVM (Support Vector Machine) was introduced by Vapnik(|j|) in 
the late 1970's. Since then the idea of SVM found its application in many diverse 
fields such as machine learning, gene expression data analysis, high energy physics 
experiment at CERN (European Organization for Nuclear Research). Why the 
idea of SVM has been used in such diverse and unrelated fields ? The reason is 
clear and obvious: SVM, based on a solid mathematical foundation, attempts to 
solve a universal problem of classification, i.e., we need to know which belongs to 
which group. The basic idea of SVM is deceptively simple. Given a collection of 
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vectors in R n , labeled +1 or -1 that are separable by a hyperplane, SVM finds the 
hyperplane with the maximal margin. More precisely, the distance between the 
closest labeled vectors to the hyperplane is maximal. (Vapnik, cleverly, connected 
this distance problem to an optimization problem by using Kuhn- Tucker condition, 
|3|). This hyperplane could be used to determine to which group an unlabeled 
vector belongs. This machine fits with inductive scientific method. 

To give you a definite flavor of SVM in everyday experience, let's consider about 
familiar concepts , speed limit, height, weight, blood pressure, Lipid measurements 
in blood etc. When the speed limit, critical values for blood pressure of normal 
people, Lipid mesurements are determined, people mainly depend on experimenatal 
data in the past. As a toy model, we considered an analogy or correspondence 
between finding the speed limit on the road and using Support Vector Machine 
for a criterion to determine an association between a given set of multiple SNP 
variations and a disease or trait. 

In mathematical setting, car speed is a point in R , while a set of numbers con- 
sisting of SNP variations (or anything we count several variables at the same time) 
is represented as a point of R n . 



Speed 
A Single Number 



Feature Vector 
A Set of Numbers 



Speed Limit 
By Simple Statistic 



Hyperplane of Criterion 
By Support Vector Machine 



Conclusion: We come to the conclusion that we have to find out a way of 
representation of SNP variataions at each position. This subject is open and 
could be adjusted with experiments for better performance.^] Suppose we 
want to express east, west, south and north(or DNA letters, A, C, G, T). 
Then we may represent them as {(1, 0, 0, 0), (0, 1, 0, 0), (0, 0, 1, 0), (0,0,0,1)} 
or {0.2,0.4,0.6,0.8}. This way, at each SNP location, we have a number 
depending on genotype in a consistent way, which give us a vector. 

3. Test Results with clinical data 

We generated feature vectors of cardio-patient records by using the same prin- 
ciple described in section 2. Height, age, sex, weight, ethnic background, medical 
history, birth place, blood pressure (systolic and diastolic), Lipid measurements etc 
are numericalized and we labeled +1 for a patient who had a history of either heart 
attack, stroke or heart failure, otherwise -1. We used Thorsten Joachims' imple- 
mentation of SVM, which gives us the following results(See Q and, for a different 



1 Aftcr wc found out to use the SVM to classify multiple SNP variations, Honki Kim, statisti- 
cian, pointed out that Classification tree(or decision tree) might work as well. 
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Test 


No of Patients 


+ 1 labeled 


-1 labeled 


C (bound) 


Misclassificd 


postoneg 


negtopos 


1 


1000 


212 


788 


1 


56 


32 


24 


2 


1000 


212 


788 


2 


41 


23 


18 


3 


1000 


409 


591 


1 


153 


37 


116 


4 


4000 


1055 


2945 


1 


438 


168 


270 



Table 1. Tests with clinical data 



implementation, Q). The results strongly indicate that SVM works as intended to 
separate the data set into two classes. 

For the summary of tests, see the Table 1. 

1: Postoneg means the number of +1 labeled vectors in the group of -1 labeled 
majority, while negtopos the number of -1 labeled vectors in the group of +1 
labeled. 

2: Test 1 and 2 are the same data with different C values. 
3: Test 1 and 3 are different. 
4: Test 3 is contained in Test 4. 



4. Implication 

Support Vector Machine can be applied for diagnosis of diseases and drug ad- 
verse. If, for each possible patient, we input all the test results as a vector, the 
status of a disease and its prescription could be determined from the past disease 
records. It should be noted that the data is not limited to numerical ones and it 
could include visual data such as X-ray or MRI image and possibly other sources. 
For example, in the image data, one extracts area, length, its topological invariant 
and others for the totality of input data. 

Due to its generic nature of SVM already found its application in diverse field and 
it may find even more application elsewhere. (Depending on the users's insight and 
intuitions, for example, putting genotypes with phenotypes, drug and phenotypes 
or genotyes etc.) 
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